{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "ur8xi4C7S06n"
   },
   "outputs": [],
   "source": [
    "# Copyright 2019 Google LLC\n",
    "#\n",
    "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
    "# you may not use this file except in compliance with the License.\n",
    "# You may obtain a copy of the License at\n",
    "#\n",
    "#     https://www.apache.org/licenses/LICENSE-2.0\n",
    "#\n",
    "# Unless required by applicable law or agreed to in writing, software\n",
    "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
    "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
    "# See the License for the specific language governing permissions and\n",
    "# limitations under the License."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "DHxMX0JAMELh"
   },
   "source": [
    "# **Purchase Prediction with AutoML Tables**\n",
    "\n",
    "<table align=\"left\">\n",
    "  <td>\n",
    "    <a href=\"https://colab.sandbox.google.com/github/GoogleCloudPlatform/python-docs-samples/blob/master/tables/automl/notebooks/purchase_prediction/purchase_prediction.ipynb\">\n",
    "      <img src=\"https://cloud.google.com/ml-engine/images/colab-logo-32px.png\" alt=\"Colab logo\"> Run in Colab\n",
    "    </a>\n",
    "  </td>\n",
    "  <td>\n",
    "    <a href=\"https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/tables/automl/notebooks/purchase_prediction/purchase_prediction.ipynb\">\n",
    "      <img src=\"https://cloud.google.com/ml-engine/images/github-logo-32px.png\" alt=\"GitHub logo\">\n",
    "      View on GitHub\n",
    "    </a>\n",
    "  </td>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "tvgnzT1CKxrO"
   },
   "source": [
    "## **Overview**\n",
    "\n",
    "One of the most common use cases in Marketing is to predict the likelihood of conversion. Conversion could be defined by the marketer as taking a certain action like making a purchase, signing up for a free trial, subscribing to a newsletter, etc. Knowing the likelihood that a marketing lead or prospect will ‘convert’ can enable the marketer to target the lead with the right marketing campaign. This could take the form of remarketing, targeted email campaigns, online offers or other treatments.\n",
    "\n",
    "Here we demonstrate how you can use BigQuery and AutoML Tables to build a supervised binary classification model for purchase prediction."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "sukxx8RLSjRr"
   },
   "source": [
    "### **Dataset**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "mmn5rn7kScSt"
   },
   "source": [
    "The model uses a real dataset from the [Google Merchandise store](https://www.googlemerchandisestore.com/) consisting of Google Analytics web sessions.\n",
    "\n",
    "The goal here is to predict the likelihood of a web visitor visiting the online Google Merchandise Store making a purchase on the website during that Google Analytics session. Past web interactions of the user on the store website in addition to information like browser details and geography are used to make this prediction.\n",
    "\n",
    "This is framed as a binary classification model, to label a user during a session as either true (makes a purchase) or false (does not make a purchase). Dataset Details The dataset consists of a set of tables corresponding to Google Analytics sessions being tracked on the Google Merchandise Store. Each table is a single day of GA sessions. More details around the schema can be seen here.\n",
    "\n",
    "You can access the data on BigQuery [here](https://support.google.com/analytics/answer/3437719?hl=en&ref_topic=3416089)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "SLq3FfRa8E8X"
   },
   "source": [
    "### **Costs**\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "DzxIfOrB71wl"
   },
   "source": [
    "This tutorial uses billable components of Google Cloud Platform (GCP):\n",
    "\n",
    "* Cloud AI Platform\n",
    "* Cloud Storage\n",
    "* BigQuery\n",
    "* AutoML Tables\n",
    "\n",
    "Learn about [Cloud AI Platform pricing](https://cloud.google.com/ml-engine/docs/pricing), [Cloud Storage pricing](https://cloud.google.com/storage/pricing), [BigQuery pricing](https://cloud.google.com/bigquery/pricing) and [AutoML Tables pricing](https://cloud.google.com/automl-tables/pricing), and use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "ze4-nDLfK4pw"
   },
   "source": [
    "## Set up your local development environment\n",
    "\n",
    "**If you are using Colab or AI Platform Notebooks**, your environment already meets\n",
    "all the requirements to run this notebook. If you are using **AI Platform Notebook**, make sure the machine configuration type is **4 vCPU, 15 GB RAM** or above. You can skip this step."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "gCuSR8GkAgzl"
   },
   "source": [
    "**Otherwise**, make sure your environment meets this notebook's requirements.\n",
    "You need the following:\n",
    "\n",
    "* The Google Cloud SDK\n",
    "* Git\n",
    "* Python 3\n",
    "* virtualenv\n",
    "* Jupyter notebook running in a virtual environment with Python 3\n",
    "\n",
    "The Google Cloud guide to [Setting up a Python development\n",
    "environment](https://cloud.google.com/python/setup) and the [Jupyter\n",
    "installation guide](https://jupyter.org/install) provide detailed instructions\n",
    "for meeting these requirements. The following steps provide a condensed set of\n",
    "instructions:\n",
    "\n",
    "1. [Install and initialize the Cloud SDK.](https://cloud.google.com/sdk/docs/)\n",
    "\n",
    "2. [Install Python 3.](https://cloud.google.com/python/setup#installing_python)\n",
    "\n",
    "3. [Install\n",
    "   virtualenv](https://cloud.google.com/python/setup#installing_and_using_virtualenv)\n",
    "   and create a virtual environment that uses Python 3.\n",
    "\n",
    "4. Activate that environment and run `pip install jupyter` in a shell to install\n",
    "   Jupyter.\n",
    "\n",
    "5. Run `jupyter notebook` in a shell to launch Jupyter.\n",
    "\n",
    "6. Open this notebook in the Jupyter Notebook Dashboard."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "BF1j6f9HApxa"
   },
   "source": [
    "## **Set up your GCP project**\n",
    "\n",
    "**The following steps are required, regardless of your notebook environment.**\n",
    "\n",
    "1. [Select or create a GCP project.](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.\n",
    "\n",
    "2. [Make sure that billing is enabled for your project.](https://cloud.google.com/billing/docs/how-to/modify-project)\n",
    "\n",
    "3. [Enable the AI Platform APIs and Compute Engine APIs.](https://console.cloud.google.com/flows/enableapi?apiid=ml.googleapis.com,compute_component)\n",
    "\n",
    "4. [Enable AutoML API.](https://console.cloud.google.com/apis/library/automl.googleapis.com?q=automl)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "i7EUnXsZhAGF"
   },
   "source": [
    "## **PIP Install Packages and dependencies**\n",
    "\n",
    "Install addional dependencies not installed in Notebook environment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "n2kLhBBRvdog"
   },
   "outputs": [],
   "source": [
    "! pip install --upgrade --quiet --user google-cloud-automl\n",
    "! pip install --upgrade --quiet --user google-cloud-bigquery\n",
    "! pip install --upgrade --quiet --user google-cloud-storage\n",
    "! pip install --upgrade --quiet --user matplotlib\n",
    "! pip install --upgrade --quiet --user pandas \n",
    "! pip install --upgrade --quiet --user pandas-gbq \n",
    "! pip install --upgrade --quiet --user gcsfs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "kK5JATKPNf3I"
   },
   "source": [
    "**Note:** Try installing using `sudo`, if the above command throw any permission errors."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "f-YlNVLTYXXN"
   },
   "source": [
    "`Restart` the kernel to allow automl_v1beta1 to be imported for Jupyter Notebooks.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "C16j_LPrYbZa"
   },
   "outputs": [],
   "source": [
    "from IPython.core.display import HTML\n",
    "HTML(\"<script>Jupyter.notebook.kernel.restart()</script>\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "tPXmVHerC58T"
   },
   "source": [
    "## **Set up your GCP Project Id**\n",
    "\n",
    "Enter your `Project Id` in the cell below. Then run the  cell to make sure the\n",
    "Cloud SDK uses the right project for all the commands in this notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "2hI1ChtyvXa4"
   },
   "outputs": [],
   "source": [
    "PROJECT_ID = \"[your-project-id]\" # @param {type:\"string\"}\n",
    "COMPUTE_REGION = \"us-central1\" # Currently only supported region."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "dr--iN2kAylZ"
   },
   "source": [
    "## **Authenticate your GCP account**\n",
    "\n",
    "**If you are using AI Platform Notebooks**, your environment is already\n",
    "authenticated. Skip this step."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "3yyVCJHFSEKG"
   },
   "source": [
    "Otherwise, follow these steps:\n",
    "\n",
    "1. In the GCP Console, go to the [**Create service account key**\n",
    "   page](https://console.cloud.google.com/apis/credentials/serviceaccountkey).\n",
    "\n",
    "2. From the **Service account** drop-down list, select **New service account**.\n",
    "\n",
    "3. In the **Service account name** field, enter a name.\n",
    "\n",
    "4. From the **Role** drop-down list, select\n",
    "   **AutoML > AutoML Admin**,\n",
    "   **Storage > Storage Admin** and **BigQuery > BigQuery Admin**.\n",
    "\n",
    "5. Click *Create*. A JSON file that contains your key downloads to your\n",
    "local environment."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Yt6PhVG0UdF1"
   },
   "source": [
    "**Note**: Jupyter runs lines prefixed with `!` as shell commands, and it interpolates Python variables prefixed with `$` into these commands."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "q5TeVHKDMOJF"
   },
   "outputs": [],
   "source": [
    "import sys\n",
    "\n",
    "# Upload the downloaded JSON file that contains your key.\n",
    "if 'google.colab' in sys.modules:    \n",
    "  from google.colab import files\n",
    "  keyfile_upload = files.upload()\n",
    "  keyfile = list(keyfile_upload.keys())[0]\n",
    "  %env GOOGLE_APPLICATION_CREDENTIALS $keyfile\n",
    "  ! gcloud auth activate-service-account --key-file $keyfile"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "d1bnPeDVMR5Q"
   },
   "source": [
    "***If you are running the notebook locally***, enter the path to your service account key as the `GOOGLE_APPLICATION_CREDENTIALS` variable in the cell below and run the cell"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "fsVNKXESYoeQ"
   },
   "outputs": [],
   "source": [
    "# If you are running this notebook locally, replace the string below with the\n",
    "# path to your service account key and run this cell to authenticate your GCP\n",
    "# account.\n",
    "\n",
    "%env GOOGLE_APPLICATION_CREDENTIALS /path/to/service/account\n",
    "! gcloud auth activate-service-account --key-file '/path/to/service/account'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "zgPO1eR3CYjk"
   },
   "source": [
    "## **Create a Cloud Storage bucket**\n",
    "\n",
    "**The following steps are required, regardless of your notebook environment.**\n",
    "\n",
    "When you submit a training job using the Cloud SDK, you upload a Python package\n",
    "containing your training code to a Cloud Storage bucket. AI Platform runs\n",
    "the code from this package. In this tutorial, AI Platform also saves the\n",
    "trained model that results from your job in the same bucket. You can then\n",
    "create an AI Platform model version based on this output in order to serve\n",
    "online predictions.\n",
    "\n",
    "Set the name of your Cloud Storage bucket below. It must be unique across all\n",
    "Cloud Storage buckets. \n",
    "\n",
    "You may also change the `REGION` variable, which is used for operations\n",
    "throughout the rest of this notebook. Make sure to [choose a region where Cloud\n",
    "AI Platform services are\n",
    "available](https://cloud.google.com/ml-engine/docs/tensorflow/regions). You may\n",
    "not use a Multi-Regional Storage bucket for training with AI Platform."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "both",
    "colab": {},
    "colab_type": "code",
    "id": "MzGDU7TWdts_"
   },
   "outputs": [],
   "source": [
    "BUCKET_NAME = \"[your-bucket-name]\" #@param {type:\"string\"}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "-EcIXiGsCePi"
   },
   "source": [
    "**Only if your bucket doesn't exist**: Run the following cell to create your Cloud Storage bucket. Make sure Storage > Storage Admin role is enabled"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "NIq7R4HZCfIc"
   },
   "outputs": [],
   "source": [
    "! gsutil mb -p $PROJECT_ID -l $COMPUTE_REGION gs://$BUCKET_NAME"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "ucvCsknMCims"
   },
   "source": [
    "Finally, validate access to your Cloud Storage bucket by examining its contents:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "vhOb7YnwClBb"
   },
   "outputs": [],
   "source": [
    "! gsutil ls -al gs://$BUCKET_NAME"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "XoEqT2Y4DJmf"
   },
   "source": [
    "## **Import libraries and define constants**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "wkJe8sD-EoTE"
   },
   "source": [
    "Import relevant packages."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "Cj-pbWdxEtZM"
   },
   "outputs": [],
   "source": [
    "from __future__ import absolute_import\n",
    "from __future__ import division\n",
    "from __future__ import print_function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "6HT8yR2Cvd0a"
   },
   "outputs": [],
   "source": [
    "# AutoML library.\n",
    "from google.cloud import automl_v1beta1 as automl\n",
    "import google.cloud.automl_v1beta1.proto.data_types_pb2 as data_types\n",
    "from google.cloud import bigquery\n",
    "from google.cloud import storage"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "YPTWUWT0E32J"
   },
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import datetime\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "from sklearn import metrics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "MEqIjz0PFCVO"
   },
   "source": [
    "Populate the following cell with the necessary constants and run it to initialize constants."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "iXC9vCBrGTKE"
   },
   "outputs": [],
   "source": [
    "#@title Constants { vertical-output: true }\n",
    "\n",
    "# A name for the AutoML tables Dataset to create.\n",
    "DATASET_DISPLAY_NAME = 'purchase_prediction' #@param {type: 'string'}\n",
    "# A name for the file to hold the nested data.\n",
    "NESTED_CSV_NAME = 'FULL.csv' #@param {type: 'string'}\n",
    "# A name for the file to hold the unnested data.\n",
    "UNNESTED_CSV_NAME = 'FULL_unnested.csv' #@param {type: 'string'}\n",
    "# A name for the input train data.\n",
    "TRAINING_CSV = 'training_unnested_balanced_FULL' #@param {type: 'string'}\n",
    "# A name for the input validation data.\n",
    "VALIDATION_CSV = 'validation_unnested_FULL' #@param {type: 'string'}\n",
    "# A name for the AutoML tables model to create.\n",
    "MODEL_DISPLAY_NAME = 'model_1' #@param {type:'string'}\n",
    "\n",
    "assert all([\n",
    "    PROJECT_ID,\n",
    "    COMPUTE_REGION,\n",
    "    DATASET_DISPLAY_NAME,\n",
    "    MODEL_DISPLAY_NAME,\n",
    "])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "X6xxcNmOGjtY"
   },
   "source": [
    "Initialize client for AutoML, AutoML Tables, BigQuery and Storage."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "0y3EourAGWmf"
   },
   "outputs": [],
   "source": [
    "# Initialize the clients.\n",
    "automl_client = automl.AutoMlClient()\n",
    "tables_client = automl.TablesClient(project=PROJECT_ID, region=COMPUTE_REGION)\n",
    "bq_client = bigquery.Client()\n",
    "storage_client = storage.Client()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "xdJykMXDozoP"
   },
   "source": [
    "## **Test the set up**\n",
    "\n",
    "To test whether your project set up and authentication steps were successful, run the following cell to list your datasets in this project.\n",
    "\n",
    "If no dataset has previously imported into AutoML Tables, you shall expect an empty return."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "_dKylOQTpF58"
   },
   "outputs": [],
   "source": [
    "# List the datasets.\n",
    "list_datasets = tables_client.list_datasets()\n",
    "datasets = { dataset.display_name: dataset.name for dataset in list_datasets }\n",
    "datasets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "dleTdOMaplSM"
   },
   "source": [
    "You can also print the list of your models by running the following cell.\n",
    "\n",
    "If no model has previously trained using AutoML Tables, you shall expect an empty return.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "tMXP6no1pn9p"
   },
   "outputs": [],
   "source": [
    "# List the models.\n",
    "list_models = tables_client.list_models()\n",
    "models = { model.display_name: model.name for model in list_models }\n",
    "models"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Z0g-D23HYX9A"
   },
   "source": [
    "##**Transformation and Feature Engineering Functions**\n",
    "\n",
    "The data cleaning and transformation step was by far the most involved. It includes a few sections that create an AutoML tables dataset, pull the Google merchandise store data from BigQuery, transform the data, and save it multiple times to csv files in google cloud storage.\n",
    "\n",
    "The dataset that is made viewable in the AutoML Tables UI. It will eventually hold the training data after that training data is cleaned and transformed.\n",
    "\n",
    "This dataset has only around 1% of its values with a positive label value of True i.e. cases when a transaction was made. This is a class imbalance problem. There are several ways to handle class imbalance. We chose to oversample the positive class by random over sampling. This resulted in an artificial increase in the sessions with the positive label of true transaction value.\n",
    "\n",
    "There were also many columns with either all missing or all constant values. These columns would not add any signal to our model, so we dropped them.\n",
    "\n",
    "There were also columns with NaN rather than 0 values. For instance, rather than having a count of 0, a column might have a null value. So we added code to change some of these null values to 0, specifically in our target column, in which null values were not allowed by AutoML Tables. However, AutoML Tables can handle null values for the features."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "5lqd8kOlYeYx"
   },
   "source": [
    "**Feature Engineering**\n",
    "\n",
    "The dataset had rich information on customer location and behavior; however, it can be improved by performing feature engineering. Moreover, there was a concern about data leakage. The decision to do feature engineering, therefore, had two contributing motivations: remove data leakage without too much loss of useful data, and to improve the signal in our data.\n",
    "\n",
    "**Weekdays**\n",
    "\n",
    "The date seemed like a useful piece of information to include, as it could capture seasonal effects. Unfortunately, we only had one year of data, so seasonality on an annual scale would be difficult (read impossible) to incorporate. Fortunately, we could try and detect seasonal effects on a micro, with perhaps equally informative results. We ended up creating a new column of weekdays out of dates, to denote which day of the week the session was held on. This new feature turned out to have some useful predictive power, when added as a variable into our model.\n",
    "\n",
    "**Data Leakage**\n",
    "\n",
    "The marginal gain from adding a weekday feature, was overshadowed by the concern of data leakage in our training data. In the initial naive models we trained, we got outstanding results. So outstanding that we knew that something must be going on. As it turned out, quite a few features functioned as proxies for the feature we were trying to predict: meaning some of the features we conditioned on to build the model had an almost 1:1 correlation with the target feature. Intuitively, this made sense.\n",
    "\n",
    "One feature that exhibited this behavior was the number of page views a customer made during a session. By conditioning on page views in a session, we could very reliably predict which customer sessions a purchase would be made in. At first this seems like the golden ticket, we can reliably predict whether or not a purchase is made! The catch: the full page view information can only be collected at the end of the session, by which point we would also have whether or not a transaction was made. Seen from this perspective, collecting page views at the same time as collecting the transaction information would make it pointless to predict the transaction information using the page views information, as we would already have both. One solution was to drop page views as a feature entirely. This would safely stop the data leakage, but we would lose some critically useful information. Another solution, (the one we ended up going with), was to track the page view information of all previous sessions for a given customer, and use it to inform the current session. This way, we could use the page view information, but only the information that we would have before the session even began. So we created a new column called previous_views, and populated it with the total count of all previous page views made by the customer in all previous sessions. We then deleted the page views feature, to stop the data leakage.\n",
    "\n",
    "Our rationale for this change can be boiled down to the concise heuristic: only use the information that is available to us on the first click of the session. Applying this reasoning, we performed similar data engineering on other features which we found to be proxies for the label feature. We also refined our objective in the process: For a visit to the Google Merchandise store, what is the probability that a customer will make a purchase, and can we calculate this probability the moment the customer arrives? By clarifying the question, we both made the result more powerful/useful, and eliminated the data leakage that threatened to make the predictive power trivial."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "BVIYkceJUjCz"
   },
   "outputs": [],
   "source": [
    "def balanceTable(table):\n",
    "  # class count.\n",
    "  count_class_false, count_class_true = table.totalTransactionRevenue\\\n",
    "                                        .value_counts()\n",
    "\n",
    "  # divide by class.\n",
    "  table_class_false = table[table[\"totalTransactionRevenue\"]==False]\n",
    "  table_class_true = table[table[\"totalTransactionRevenue\"]==True]\n",
    "\n",
    "  # random over-sampling.\n",
    "  table_class_true_over = table_class_true.sample(\n",
    "                          count_class_false, replace=True)\n",
    "  table_test_over = pd.concat([table_class_false, table_class_true_over])\n",
    "  return table_test_over"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "pBMg-NHTUnMU"
   },
   "outputs": [],
   "source": [
    "def partitionTable(table, dt=20170500):\n",
    "  # The automl tables model could be training on future data and implicitly learning about past data in the testing\n",
    "  # dataset, this would cause data leakage. To prevent this, we are training only with the first 9 months of data (table1)\n",
    "  # and doing validation with the last three months of data (table2).\n",
    "  table1 = table[table[\"date\"]<=dt].copy(deep=False)\n",
    "  table2 = table[table[\"date\"]>dt].copy(deep=False)\n",
    "  return table1, table2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "smziJuelUqbC"
   },
   "outputs": [],
   "source": [
    "def N_updatePrevCount(table, new_column, old_column):\n",
    "  table = table.fillna(0)\n",
    "  table[new_column] = 1\n",
    "  table.sort_values(by=['fullVisitorId','date'])\n",
    "  table[new_column] = table.groupby(['fullVisitorId'])[old_column].apply(\n",
    "                        lambda x: x.cumsum())\n",
    "  table.drop([old_column], axis=1, inplace=True)\n",
    "  return table"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "vQ4Hlhg2Uu49"
   },
   "outputs": [],
   "source": [
    "def N_updateDate(table):\n",
    "  table['weekday'] = 1\n",
    "  table['date'] = pd.to_datetime(table['date'].astype(str), format='%Y%m%d')\n",
    "  table['weekday'] = table['date'].dt.dayofweek\n",
    "  return table"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "anX4rrFSUxlF"
   },
   "outputs": [],
   "source": [
    "def change_transaction_values(table):\n",
    "  table['totalTransactionRevenue'] = table['totalTransactionRevenue'].fillna(0)\n",
    "  table['totalTransactionRevenue'] = table['totalTransactionRevenue'].apply(\n",
    "                                      lambda x: x!=0)\n",
    "  return table"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "RRLNtUbfv3pj"
   },
   "outputs": [],
   "source": [
    "def saveTable(table, csv_file_name, bucket_name):\n",
    "  table.to_csv(csv_file_name, index=False)\n",
    "  bucket = storage_client.get_bucket(bucket_name)\n",
    "  blob = bucket.blob(csv_file_name)\n",
    "  blob.upload_from_filename(filename=csv_file_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "T1I1dkSAU73g"
   },
   "source": [
    "##**Getting training data**\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "-qfwBGWIB5Nm"
   },
   "source": [
    "\n",
    "If you are using **Colab** the memory may not be sufficient enough to generate Nested and Unnested data using the queries. In this case, you can directly download the unnested data **FULL_unnested.csv** from [here](https://storage.cloud.google.com/cloud-ml-data/automl-tables/notebooks/trial_for_c4m/FULL_unnested.csv) and upload the file manually to GCS bucket that was created in the previous steps `(BUCKET_NAME)`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "swgcbjAGLgsl"
   },
   "source": [
    "*If* you are using **AI Platform Notebook or Local environment**, run the following code"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "5CDSXB-Fv3jb"
   },
   "outputs": [],
   "source": [
    "# Save table.\n",
    "query = \"\"\"\n",
    "SELECT\n",
    " date, \n",
    " device, \n",
    " geoNetwork, \n",
    " totals, \n",
    " trafficSource, \n",
    " fullVisitorId \n",
    "FROM \n",
    " `bigquery-public-data.google_analytics_sample.ga_sessions_*`\n",
    "WHERE\n",
    " _TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d',DATE_SUB('2017-08-01', INTERVAL 366 DAY)) AND\n",
    " FORMAT_DATE('%Y%m%d',DATE_SUB('2017-08-01', INTERVAL 1 DAY))\n",
    "\"\"\"\n",
    "df = bq_client.query(query).to_dataframe()\n",
    "print(df.iloc[:3])\n",
    "saveTable(df, NESTED_CSV_NAME, BUCKET_NAME)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "pTHwOgw8ArcA"
   },
   "outputs": [],
   "source": [
    "# Unnest the Data.\n",
    "nested_gcs_uri = 'gs://{}/{}'.format(BUCKET_NAME, NESTED_CSV_NAME)\n",
    "table = pd.read_csv(nested_gcs_uri, low_memory=False)\n",
    "\n",
    "column_names = ['device', 'geoNetwork','totals', 'trafficSource']\n",
    "\n",
    "for name in column_names:\n",
    "  print(name)\n",
    "  table[name] = table[name].apply(lambda i: dict(eval(i)))\n",
    "  temp = table[name].apply(pd.Series)\n",
    "  table = pd.concat([table, temp], axis=1).drop(name, axis=1)\n",
    "\n",
    "# need to drop a column.\n",
    "table.drop(['adwordsClickInfo'], axis=1, inplace=True)\n",
    "saveTable(table, UNNESTED_CSV_NAME, BUCKET_NAME)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "1UL8YqzdWXeu"
   },
   "source": [
    "### **Run the Transformations**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "JJ84Zs68wN3X"
   },
   "outputs": [],
   "source": [
    "# Run the transformations.\n",
    "unnested_gcs_uri = 'gs://{}/{}'.format(BUCKET_NAME, UNNESTED_CSV_NAME)\n",
    "table = pd.read_csv(unnested_gcs_uri, low_memory=False)\n",
    "\n",
    "consts = ['transactionRevenue', 'transactions', 'adContent', 'browserSize', \n",
    "          'campaignCode', 'cityId', 'flashVersion', 'javaEnabled', 'language', \n",
    "          'latitude', 'longitude', 'mobileDeviceBranding', 'mobileDeviceInfo', \n",
    "          'mobileDeviceMarketingName','mobileDeviceModel','mobileInputSelector',\n",
    "          'networkLocation', 'operatingSystemVersion', 'screenColors', \n",
    "          'screenResolution', 'screenviews', 'sessionQualityDim', \n",
    "          'timeOnScreen', 'visits', 'uniqueScreenviews', 'browserVersion', \n",
    "          'referralPath','fullVisitorId', 'date']\n",
    "\n",
    "table = N_updatePrevCount(table, 'previous_views', 'pageviews')\n",
    "table = N_updatePrevCount(table, 'previous_hits', 'hits')\n",
    "table = N_updatePrevCount(table, 'previous_timeOnSite', 'timeOnSite')\n",
    "table = N_updatePrevCount(table, 'previous_Bounces', 'bounces')\n",
    "\n",
    "table = change_transaction_values(table)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "mTdp0V1wnPer"
   },
   "outputs": [],
   "source": [
    "table1, table2 = partitionTable(table)\n",
    "table1 = N_updateDate(table1)\n",
    "table2 = N_updateDate(table2)\n",
    "\n",
    "table1.drop(consts, axis=1, inplace=True)\n",
    "table2.drop(consts, axis=1, inplace=True)\n",
    "\n",
    "saveTable(table2,'{}.csv'.format(VALIDATION_CSV), BUCKET_NAME)\n",
    "\n",
    "table1 = balanceTable(table1)\n",
    "\n",
    "# training_unnested_FULL.csv = the first 9 months of data.\n",
    "saveTable(table1, '{}.csv'.format(TRAINING_CSV), BUCKET_NAME)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "8ZpdDzvPP3Gr"
   },
   "source": [
    "## **Import Training Data**\n",
    "\n",
    "Select a dataset display name and pass your table source information to create a new dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "SZy-Idpsdn2_"
   },
   "source": [
    "#### **Create Dataset**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "ZaKxxQTevuV7"
   },
   "outputs": [],
   "source": [
    "# Create dataset.\n",
    "dataset = tables_client.create_dataset(\n",
    "    dataset_display_name=DATASET_DISPLAY_NAME)\n",
    "dataset_name = dataset.name\n",
    "dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "-6ujokeldxof"
   },
   "source": [
    "#### **Import Data**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "VDcwd-tswNxn"
   },
   "outputs": [],
   "source": [
    "# Read the data source from GCS. \n",
    "dataset_gcs_input_uris = ['gs://{}/{}.csv'.format(BUCKET_NAME, TRAINING_CSV)]\n",
    "\n",
    "import_data_response = tables_client.import_data(\n",
    "    dataset=dataset,\n",
    "    gcs_input_uris=dataset_gcs_input_uris\n",
    ")\n",
    "\n",
    "print('Dataset import operation: {}'.format(import_data_response.operation))\n",
    "\n",
    "# Synchronous check of operation status. Wait until import is done.\n",
    "print('Dataset import response: {}'.format(import_data_response.result()))\n",
    "\n",
    "# Verify the status by checking the example_count field.\n",
    "dataset = tables_client.get_dataset(dataset_name=dataset_name)\n",
    "dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "uXpSJ3T-S1xx"
   },
   "source": [
    "## **Review the specs**\n",
    "Run the following command to see table specs such as row count."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "XQHzt60WwNhI"
   },
   "outputs": [],
   "source": [
    "# List table specs.\n",
    "list_table_specs_response = tables_client.list_table_specs(dataset=dataset)\n",
    "table_specs = [s for s in list_table_specs_response]\n",
    "\n",
    "# List column specs.\n",
    "list_column_specs_response = tables_client.list_column_specs(dataset=dataset)\n",
    "column_specs = {s.display_name: s for s in list_column_specs_response}\n",
    "\n",
    "# Print Features and data_type.\n",
    "features = [(key, data_types.TypeCode.Name(value.data_type.type_code)) \n",
    "            for key, value in column_specs.items()]\n",
    "print('Feature list:\\n')\n",
    "for feature in features:\n",
    "    print(feature[0],':', feature[1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "_9AIZL9xTIPV"
   },
   "outputs": [],
   "source": [
    "# Table schema pie chart.\n",
    "type_counts = {}\n",
    "for column_spec in column_specs.values():\n",
    "  type_name = data_types.TypeCode.Name(column_spec.data_type.type_code)\n",
    "  type_counts[type_name] = type_counts.get(type_name, 0) + 1\n",
    "    \n",
    "plt.pie(x=type_counts.values(), labels=type_counts.keys(), autopct='%1.1f%%')\n",
    "plt.axis('equal')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "gOeAP21SWrl1"
   },
   "source": [
    "##**Update dataset: assign a label column and enable nullable columns**\n",
    "AutoML Tables automatically detects your data column type. Depending on the type of your label column, AutoML Tables chooses to run a classification or regression model. If your label column contains only numerical values, but they represent categories, change your label column type to categorical by updating your schema."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "8g5I3Ua-Sheq"
   },
   "source": [
    "### **Update a column: set to not nullable**\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "pZzF09ogwiu_"
   },
   "outputs": [],
   "source": [
    "# Update column.\n",
    "column_spec_display_name = 'totalTransactionRevenue' #@param {type: 'string'}\n",
    "update_column_response = tables_client.update_column_spec(\n",
    "    dataset=dataset,\n",
    "    column_spec_display_name=column_spec_display_name,\n",
    "    nullable=False,\n",
    ")\n",
    "update_column_response"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "KZQftXACy21j"
   },
   "source": [
    "**Tip:** You can use kwarg `type_code='CATEGORY'` in the preceding `update_column_spec(..)` call to convert the column data type from `FLOAT64` to `CATEGORY`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "y1NpM6k7XEDm"
   },
   "source": [
    "###**Update dataset: assign a target column**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "714Fydm8winh"
   },
   "outputs": [],
   "source": [
    "# Assign target column.\n",
    "column_spec_display_name = 'totalTransactionRevenue' #@param {type: 'string'}\n",
    "update_dataset_response = tables_client.set_target_column(\n",
    "    dataset=dataset,\n",
    "    column_spec_display_name=column_spec_display_name,\n",
    ")\n",
    "update_dataset_response"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "9jzfkZGVeZUA"
   },
   "source": [
    "##**Creating a model**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Cb7KjMuzXRNq"
   },
   "source": [
    "####**Train a model**\n",
    "\n",
    "To create the datasets for training, testing and validation, we first had to consider what kind of data we were dealing with. The data we had keeps track of all customer sessions with the Google Merchandise store over a year. AutoML tables does its own training and testing, and delivers a quite nice UI to view the results in. For the training and testing dataset then, we simply used the over sampled, balanced dataset created by the transformations described above. But we first partitioned the dataset to include the first 9 months in one table and the last 3 in another. This allowed us to train and test with an entirely different dataset that what we used to validate.\n",
    "\n",
    "Moreover, we held off on oversampling for the validation dataset, to not bias the data that we would ultimately use to judge the success of our model.\n",
    "\n",
    "The decision to divide the sessions along time was made to avoid the model training on future data to predict past data. (This can be avoided with a datetime variable in the dataset and by toggling a button in the UI)\n",
    "\n",
    "Training the model may take one hour or more. The following cell keeps running until the training is done. If your Colab times out, use `client.list_models()` to check whether your model has been created. Then use model name to continue to the next steps. Run the following command to retrieve your model. Replace `model_name` with its actual value.\n",
    "\n",
    "    model = client.get_model(model_name=model_name)\n",
    "\n",
    "Note that we trained on the first 9 months of data and we validate using the last 3.\n",
    "\n",
    "For demonstration purpose, the following command sets the budget as 1 node hour `('train_budget_milli_node_hours': 1000)`. You can increase that number up to a maximum of 72 hours `('train_budget_milli_node_hours': 72000)` for the best model performance.\n",
    "\n",
    "Even with a budget of 1 node hour (the minimum possible budget), training a model can take more than the specified node hours.\n",
    "\n",
    "You can also select the objective to optimize your model training by setting optimization_objective. This solution optimizes the model by using default optimization objective. Refer [link](https://cloud.google.com/automl-tables/docs/train#opt-obj) for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "HB3ZX_BMwiep"
   },
   "outputs": [],
   "source": [
    "# The number of hours to train the model.\n",
    "model_train_hours = 1 #@param {type:'integer'}\n",
    "\n",
    "create_model_response = tables_client.create_model(\n",
    "    MODEL_DISPLAY_NAME,\n",
    "    dataset=dataset,\n",
    "    train_budget_milli_node_hours=model_train_hours*1000,\n",
    ")\n",
    "\n",
    "operation_id = create_model_response.operation.name\n",
    "\n",
    "print('Create model operation: {}'.format(create_model_response.operation))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "y3J0reWbTsrW"
   },
   "outputs": [],
   "source": [
    "# Wait until model training is done.\n",
    "model = create_model_response.result()\n",
    "model_name = model.name\n",
    "model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "s9rUSDDQXse3"
   },
   "source": [
    "##**Make a prediction**\n",
    "In this section, we take our validation data prediction results and plot the Precision Recall curve and the ROC curve of both the false and true predictions.\n",
    "\n",
    "There are two different prediction modes: online and batch. The following cell shows you how to make a batch prediction. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "both",
    "colab": {},
    "colab_type": "code",
    "id": "OJ3DPwzkwiOe"
   },
   "outputs": [],
   "source": [
    "#@title Start batch prediction { vertical-output: true }\n",
    "\n",
    "batch_predict_gcs_input_uris = ['gs://{}/{}.csv'.format(BUCKET_NAME, VALIDATION_CSV)] #@param {type:'string'}\n",
    "batch_predict_gcs_output_uri_prefix = 'gs://{}'.format(BUCKET_NAME) #@param {type:'string'}\n",
    "\n",
    "batch_predict_response = tables_client.batch_predict(\n",
    "    model=model, \n",
    "    gcs_input_uris=batch_predict_gcs_input_uris,\n",
    "    gcs_output_uri_prefix=batch_predict_gcs_output_uri_prefix,\n",
    ")\n",
    "print('Batch prediction operation: {}'.format(batch_predict_response.operation))\n",
    "\n",
    "# Wait until batch prediction is done.\n",
    "batch_predict_result = batch_predict_response.result()\n",
    "batch_predict_response.metadata"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "S4aNtFCPX9Ew"
   },
   "source": [
    "##**Evaluate your prediction**\n",
    "The follow cell creates a Precision Recall curve and a ROC curve for both the true and false classifications."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "IOeudrAvdreq"
   },
   "outputs": [],
   "source": [
    "def invert(x):\n",
    "  return 1-x\n",
    "\n",
    "def switch_label(x):\n",
    "  return(not x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "OdtcQU5kVkem"
   },
   "outputs": [],
   "source": [
    "batch_predict_results_location = batch_predict_response.metadata\\\n",
    "                                 .batch_predict_details.output_info\\\n",
    "                                 .gcs_output_directory\n",
    "table = pd.read_csv('{}/tables_1.csv'.format(batch_predict_results_location))\n",
    "y = table[\"totalTransactionRevenue\"]\n",
    "scores = table[\"totalTransactionRevenue_True_score\"]\n",
    "scores_invert = table['totalTransactionRevenue_False_score']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "_tYEgv_IeL3T"
   },
   "outputs": [],
   "source": [
    "# code for ROC curve, for true values.\n",
    "fpr, tpr, thresholds = metrics.roc_curve(y, scores)\n",
    "roc_auc = metrics.auc(fpr, tpr)\n",
    "plt.figure()\n",
    "lw = 2\n",
    "plt.plot(fpr, tpr, color='darkorange',\n",
    "         lw=lw, label='ROC curve (area=%0.2f)' % roc_auc)\n",
    "plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')\n",
    "plt.xlim([0.0, 1.0])\n",
    "plt.ylim([0.0, 1.05])\n",
    "plt.xlabel('False Positive Rate')\n",
    "plt.ylabel('True Positive Rate')\n",
    "plt.title('Receiver operating characteristic for True')\n",
    "plt.legend(loc=\"lower right\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "RAWpzQjReQxk"
   },
   "outputs": [],
   "source": [
    "# code for ROC curve, for false values.\n",
    "plt.figure()\n",
    "lw = 2\n",
    "label_invert = y.apply(switch_label)\n",
    "fpr, tpr, thresholds = metrics.roc_curve(label_invert, scores_invert)\n",
    "plt.plot(fpr, tpr, color='darkorange',\n",
    "         lw=lw, label='ROC curve (area=%0.2f)' % roc_auc)\n",
    "plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')\n",
    "plt.xlim([0.0, 1.0])\n",
    "plt.ylim([0.0, 1.05])\n",
    "plt.xlabel('False Positive Rate')\n",
    "plt.ylabel('True Positive Rate')\n",
    "plt.title('Receiver operating characteristic for False')\n",
    "plt.legend(loc=\"lower right\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "dcoUEakxeXKe"
   },
   "outputs": [],
   "source": [
    "# code for PR curve, for true values.\n",
    "precision, recall, thresholds = metrics.precision_recall_curve(y, scores)\n",
    "plt.figure()\n",
    "lw = 2\n",
    "plt.plot( recall, precision, color='darkorange',\n",
    "         lw=lw, label='Precision recall curve for True')\n",
    "plt.xlim([0.0, 1.0])\n",
    "plt.ylim([0.0, 1.05])\n",
    "plt.xlabel('Recall')\n",
    "plt.ylabel('Precision')\n",
    "plt.title('Precision Recall Curve for True')\n",
    "plt.legend(loc=\"lower right\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "both",
    "colab": {},
    "colab_type": "code",
    "id": "wx-hFytjwiLJ"
   },
   "outputs": [],
   "source": [
    "# code for PR curve, for false values.\n",
    "precision, recall, thresholds = metrics.precision_recall_curve(\n",
    "                                label_invert, scores_invert)\n",
    "print(precision.shape)\n",
    "print(recall.shape)\n",
    "\n",
    "plt.figure()\n",
    "lw = 2\n",
    "plt.plot( recall, precision, color='darkorange',\n",
    "          label='Precision recall curve for False')\n",
    "plt.xlim([0.0, 1.1])\n",
    "plt.ylim([0.0, 1.1])\n",
    "plt.xlabel('Recall')\n",
    "plt.ylabel('Precision')\n",
    "plt.title('Precision Recall Curve for False')\n",
    "plt.legend(loc=\"lower right\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "HAivzUjcVJgT"
   },
   "source": [
    "## **Cleaning up**\n",
    "\n",
    "To clean up all GCP resources used in this project, you can [delete the GCP\n",
    "project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "sx_vKniMq9ZX"
   },
   "outputs": [],
   "source": [
    "# Delete model resource.\n",
    "tables_client.delete_model(model_name=model_name)\n",
    "\n",
    "# Delete dataset resource.\n",
    "tables_client.delete_dataset(dataset_name=dataset_name)\n",
    "\n",
    "# Delete Cloud Storage objects that were created.\n",
    "! gsutil -m rm -r gs://$BUCKET_NAME\n",
    "\n",
    "# If training model is still running, cancel it.\n",
    "automl_client.transport._operations_client.cancel_operation(operation_id)"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "purchase_prediction.ipynb",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
